Overview
Crop yield is influenced by several factors, including temperature, NPK (Nitrogen, Phosphorus, Potassium) levels in the soil and fertilizer application. Optimal Temperature ranges vary depending on the crop, but generally, excessive heat or cold can negatively impact growth and yield.
NPK levels play a crucial role in plant nutrition, with nitrogen promoting vegetative growth, Phosphorus supporting root development and flowering and Potassium enhancing overall plant health and disease resistance.
Fertilizer application can improve crop yield by replenishing essential nutrients in the soil, but its important to use the right type and amount of fertilizer for the specific crop and soil conditions.
Data Preparation
# Load library
library(readxl)
library(dplyr)
library(janitor)
library(naniar)
library(ggplot2)
library(plotly)
library(tidyr)
# Import data
data <- read_excel("./00_raw_data/crop yield data sheet.xlsx")
# Rename column names
data <- data %>% rename(`Rainfall (mm)` = `Rain Fall (mm)`, 'Yield (Q/acre)' = `Yeild (Q/acre)`, `Temperature` = Temperatue)
# Clean column names
data <- data %>% clean_names()
# Check for NAs
colnames(data)[apply(data, 2, anyNA)]
[1] "rainfall_mm" "fertilizer" "temperature" "nitrogen_n" "phosphorus_p" "potassium_k" "yield_q_acre"
# Check the structure of the data
str(data)
tibble [109 × 7] (S3: tbl_df/tbl/data.frame)
$ rainfall_mm : num [1:109] 1230 480 1250 450 1200 ...
$ fertilizer : num [1:109] 80 60 75 65 80 70 71 65 77 50 ...
$ temperature : chr [1:109] "28" "36" "29" "35" ...
$ nitrogen_n : num [1:109] 80 70 78 70 79 74 77 67 78 60 ...
$ phosphorus_p: num [1:109] 24 20 22 19 22 22 21 18 23 18 ...
$ potassium_k : num [1:109] 20 18 19 18 19 16 20 15 20 15 ...
$ yield_q_acre: num [1:109] 12 8 11 9 11 10 11 7 12 6 ...
# Convert the temperature to numeric
data$temperature <- as.numeric(data$temperature, data, na.rm=T)
Warning: NAs introduced by coercion
A threshold was defined to do know the percentage of missing data.
# Calculate the NA percentage per column
na_percent <- colSums(is.na(data)) / nrow(data) * 100
# Filter columns above a certain threshold
threshold <- 30
high_na_columns <- na_percent[na_percent > threshold]
print(high_na_columns)
named numeric(0)
print(na_percent)
rainfall_mm fertilizer temperature nitrogen_n phosphorus_p potassium_k yield_q_acre
9.174312 9.174312 9.174312 9.174312 9.174312 9.174312 9.174312
The high_na_columns returned named numeric(0), which means no column
have more than 30% missing values. The full na_percent vector shows all
columns have around 9.17% or less NA values.
Data Visualisation on Missing Data
# Visualise missing Na's using naniar package
gg_miss_var(data, show_pct = T)
# Using visdat to show missing data
library(visdat)
# Visualise the structure and missing values
vis_miss(data)
# Using manual plot to visualise data
na_percent <- colSums(is.na(data)) / nrow(data) * 100
na_df <- data.frame(column = names(na_percent), na_percent = na_percent)
ggplot(na_df, aes(x = reorder(column, -na_percent), y = na_percent)) +
geom_col(fill = "tomato") +
labs(title = "Percentage of Missing Data per Column", x= "Column", y="Missing Data (%)") +
theme_minimal() +
coord_flip()
The visualisations shows the percentage of missing values for each variable in the dataset. However, the visualisations alone cannot directly determine whether the data is Missing Completely at Random (MCAR), Missing at Random (MAR) or Missing Not at Random (MAR). To determine the missing data mechanism, statistical tests and analyses were performed.
library(MissMech)
# Ensure all data is numeric
numeric_data <- data[sapply(data, is.numeric)]
# Run the test
TestMCARNormality(numeric_data)
Warning: 10 Cases with all variables missing have been removed
from the data.Error in TestMCARNormality(numeric_data) :
More than one missing data pattern should be present.
The result from TestMCARNormality indicates more than one missing data pattern should be present, meaning that only one missing data pattern exists in the dataset, which confirms the data visualisations of missing data. Using the “which” function, exactly 10 row numbers are seen, confirming same rows, same variables = 1 pattern. It can reasonably be assume MCAR. The practical choice is to use listwise deletion (the missing data is a small amount) or multiple imputation using missForest (Random Forest Imputation)
# library(missForest)
# Keep both numeric and factor columns
# mixed_data <- data[sapply(data, function(x) is.numeric(x) || is.factor(x))]
# The data structure has been checked to ensure columns are numeric
# Remove fully missing rows first
# numeric_data <- numeric_data[rowSums(is.na(numeric_data)) <ncol(numeric_data),]
# numeric_data <- as.data.frame(numeric_data)
# Run imputation
# imputed_data <- missForest(numeric_data)
# data_filled <-imputed_data$ximp
Using listwise deletion approach
clean_data <- na.omit(data)
Exploratory Data Analysis
Summary Statistics of Yield
summary(clean_data$yield_q_acre)
Min. 1st Qu. Median Mean 3rd Qu. Max.
5.500 7.000 9.000 9.051 11.000 12.000
The minimum and maximum yield are 5.5 and 12 q/acre respectively. 25% of the yield (1st quartile) is below 7 q/acre and 75% of the yield (3rd quartile) is below 11 q/acre or 25% of the yield is above 11 q/acre, whilst the average yield is 9 q/acre.
The data distribution is slightly positively skewed from the comparison values of the median and mean.
Summary Statistics of Rainfall
summary(clean_data$rainfall_mm)
Min. 1st Qu. Median Mean 3rd Qu. Max.
400.0 450.0 1150.0 849.8 1237.5 1300.0
The minimum and maximum rainfall are 400 and 1300 mm respectively. 25% of the rainfall (1st quartile) is below 450 mm and 75% of the rainfall (3rd quartile) is below 1237 mm, whilst the average rainfall is 849 mm.
The rainfall distribution is negatively skewed from the comparison values of the median and mean.
Data Visualisation of Yield vs Rainfall Using Scatter Plot
ggplot(data = clean_data, mapping = aes(x = rainfall_mm, y = yield_q_acre)) +
geom_point(color = 'darkblue') +
geom_smooth(method = 'lm', se = FALSE, color ="red") +
labs(title = "Yield vs Rainfall", x = "Rainfall (mm)", y = "Yield (Q/acre)")+
theme_minimal()
Data Visualisation of Yield vs Temperature Using Scatter Plot
ggplot(data = clean_data, mapping = aes(x = temperature, y = yield_q_acre)) +
geom_point(color = "darkblue") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = " Yield vs Temperature", x = "Temperature", y = "Yield (Q/acre)" ) +
theme_minimal()
Data Visualisation of Yield by Rainfall and
Temperature
ggplot(data = clean_data, mapping = aes(x = rainfall_mm, y = yield_q_acre, color = temperature)) +
geom_point(size = 3) +
labs(title = "Yield vs Rainfall (Coloured by Temperature)", x = "Rainfall (mm)", y = "Yield (Q/acre)") +
scale_color_gradient(low = "blue", high = "red") +
theme_minimal()
Interactive Visualisation using plotly
ggplot(data = clean_data, mapping = aes(x = rainfall_mm, y = yield_q_acre, color = temperature)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "black")+
theme_minimal()
Since fertilizer has many unique values but is not fully continuous (like 50, 52, 55…) it’s perfect for boxplot analysis and line plot analysis.
Boxplot of Yield by Fertilizer
ggplot(clean_data, aes(x = as.factor(fertilizer), y = yield_q_acre)) +
geom_boxplot(fill = "lightblue") +
labs(title = "Crop Yield by Fertilizer Level",
x = "Fertilizer", y = "Yield (Q/acre)"
) + theme_minimal()
Line Plot of Mean Yield vs Fertilizer
average_yield <- clean_data %>%
group_by(fertilizer) %>%
summarise(mean_yield = mean(yield_q_acre))
ggplotly(ggplot(data = average_yield, mapping = aes(x = fertilizer, y =mean_yield)) +
geom_line(color = "forestgreen") +
geom_point( size = 2) +
labs(title = "Average Yield by Fertilizer", x = "Fertilizer", y = "Mean Yield (Q/acre)") + theme_minimal())
Scatter Plot of Yield by Fertilizer and Rainfall
ggplot(data = clean_data, mapping = aes(x = fertilizer, y = yield_q_acre, color = rainfall_mm)) + geom_point(size = 3) +
labs(title = "Yield by Fertilizer and Rainfall", x = "Fertilizer", y = "Yield (Q/acre)") +
scale_color_gradient(low = "lightblue", high = "darkblue") +
theme_minimal()
Analysis of Yield by Macro-nutrients (Nitrogen (N), Phosphorus (P), Potassium (K)
Macro-nutrients are essential mineral elements that plants need in relatively large amounts for healthy growth and seed production. The primary macro-nutrients are nitrogen (N), phosphorus (P) and potassium (K), often represented as NPK on fertilizer labels. Nitrogen is important for leaf and stem growth, chlorophyll production and overall plant vigour. Phosphorus is important for root development, flowering, fruiting and seed formation. Potassium is essential for overall plant health, disease resistance and fruit quality.
average_yield_nutrients <- clean_data %>%
group_by(nitrogen_n, phosphorus_p, potassium_k) %>%
summarise(average_yield_q_acre = mean(yield_q_acre, .groups = "drop"))
`summarise()` has grouped output by 'nitrogen_n', 'phosphorus_p'. You can override using the `.groups` argument.
plot_ly(
data = average_yield_nutrients,
x = ~nitrogen_n,
y = ~phosphorus_p,
z = ~potassium_k,
size = ~average_yield_q_acre,
type = 'scatter3d',
mode = 'markers',
marker = list(sizemode = 'diameter', color = ~average_yield, colorscale = 'Viridis')
)
Warning: `line.width` does not currently support multiple values.Warning: `line.width` does not currently support multiple values.
ggplotly (ggplot(average_yield_nutrients, aes(x = nitrogen_n, y = phosphorus_p, size = average_yield_q_acre, color = potassium_k)) +
geom_point(alpha = 0.7) +
scale_size_continuous(range = c(2, 10)) +
labs(title = "Yield by N, P (Bubble = Yield, Color = K)",
x = "Nitrogen (N)", y = "Phosphorus (P)", size = "Avg Yield", color = "Potassium (K)") +
theme_minimal())
ggplotly (ggplot(average_yield_nutrients, aes(x = nitrogen_n, y = phosphorus_p, fill = average_yield_q_acre)) +
geom_tile(color = "white") +
scale_fill_viridis_c() +
labs(title = "Average Yield by Nitrogen and Phosphorus",
x = "Nitrogen (N)", y = "Phosphorus (P)", fill = "Avg Yield") +
theme_minimal())
Warning: data length [184] is not a sub-multiple or multiple of the number of columns [21]Warning: data length [184] is not a sub-multiple or multiple of the number of columns [21]
Diminishing Returns in Fertilizer
It is the idea that as you keep adding more fertilizer. the increase in yield gets smaller and smaller. At a point adding more might even reduce yield due to soil toxicity, soil imbalance etc.
Diminishing returns in R can be achieved using the quadratic model:
Yield=β0+β1⋅Fertilizer+β2⋅Fertilizer2+ other terms
If β2 is negative, it confirms
diminishing returns (a concave curve 📈➡️📉)
# Fit model with Fertilizer squared (quadratic term)
model_full <- lm(yield_q_acre ~ fertilizer + I(fertilizer^2) + rainfall_mm + nitrogen_n + phosphorus_p + potassium_k, data = clean_data)
# summary to check the sign and significance of fertilizer^2
summary(model_full)
Call:
lm(formula = yield_q_acre ~ fertilizer + I(fertilizer^2) + rainfall_mm +
nitrogen_n + phosphorus_p + potassium_k, data = clean_data)
Residuals:
Min 1Q Median 3Q Max
-2.11887 -0.47584 0.07636 0.40556 1.78920
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.7229605 5.2407865 -0.901 0.369839
fertilizer -0.0758227 0.1458365 -0.520 0.604371
I(fertilizer^2) 0.0003725 0.0011700 0.318 0.750937
rainfall_mm 0.0018236 0.0004860 3.752 0.000307 ***
nitrogen_n 0.0989629 0.0312298 3.169 0.002078 **
phosphorus_p 0.1673962 0.0701215 2.387 0.019022 *
potassium_k 0.2811545 0.0777775 3.615 0.000490 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7339 on 92 degrees of freedom
Multiple R-squared: 0.8692, Adjusted R-squared: 0.8606
F-statistic: 101.9 on 6 and 92 DF, p-value: < 2.2e-16
Visualise the curve
# Plot actual data points
ggplot(clean_data, aes(x = fertilizer, y = yield_q_acre)) +
geom_point(color = "darkgreen", alpha = 0.6) +
stat_smooth(method = "lm", formula = y ~ x + I(x^2), color = "blue", size = 1.2) +
labs(title = "Diminishing Returns of Fertilizer",
x = "Fertilizer (kg/acre?)",
y = "Yield (q/acre)") +
theme_minimal()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
Please use `linewidth` instead.
Visualise (Fertilizer vs. Yield with control vars )
Plotting the relationship between fertilizer and yield while holding other variables constant (at their mean)
# Create a sequence of fertilizer values
fert_range <- seq(min(clean_data$fertilizer), max(clean_data$fertilizer), length.out = 100)
# Hold other variables at their mean
rain_mean <- mean(clean_data$rainfall_mm)
N_mean <- mean(clean_data$nitrogen_n)
P_mean <- mean(clean_data$phosphorus_p)
K_mean <- mean(clean_data$potassium_k)
# Create new data frame for prediction
# Create prediction data frame with correct column names
pred_data <- data.frame(
fertilizer = fert_range,
rainfall_mm = rain_mean,
nitrogen_n = N_mean,
phosphorus_p = P_mean,
potassium_k = K_mean
)
# Predict yield
pred_data$Pred_Yield <- predict(model_full, newdata = pred_data)
# Plot
ggplot() +
geom_point(data = clean_data, aes(x = fertilizer, y = yield_q_acre), alpha = 0.5, color = "gray") +
geom_line(data = pred_data, aes(x = fertilizer, y = Pred_Yield), color = "blue", size = 1.2) +
labs(title = "Diminishing Returns of Fertilizer (controlling for Rainfall & NPK)",
x = "Fertilizer",
y = "Predicted Yield") +
theme_minimal()
Confirmatory Data Analysis and Modeling
Linear Regression
lm_model <- lm(yield_q_acre ~ rainfall_mm + temperature + fertilizer + nitrogen_n + phosphorus_p + potassium_k, data = clean_data)
summary(lm_model)
Call:
lm(formula = yield_q_acre ~ rainfall_mm + temperature + fertilizer +
nitrogen_n + phosphorus_p + potassium_k, data = clean_data)
Residuals:
Min 1Q Median 3Q Max
-1.95489 -0.41369 -0.02473 0.44995 1.58190
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.2164292 2.7693299 0.800 0.425571
rainfall_mm -0.0005803 0.0007989 -0.726 0.469489
temperature -0.1883137 0.0515511 -3.653 0.000431 ***
fertilizer -0.0272504 0.0201632 -1.351 0.179851
nitrogen_n 0.1033208 0.0282216 3.661 0.000419 ***
phosphorus_p 0.1153548 0.0669197 1.724 0.088107 .
potassium_k 0.3061350 0.0728058 4.205 6.05e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6862 on 92 degrees of freedom
Multiple R-squared: 0.8856, Adjusted R-squared: 0.8781
F-statistic: 118.7 on 6 and 92 DF, p-value: < 2.2e-16
Regression with interaction term in R
model2 <- lm(yield_q_acre ~ rainfall_mm * fertilizer + temperature + nitrogen_n + phosphorus_p + potassium_k, data = clean_data)
summary(model2)
Call:
lm(formula = yield_q_acre ~ rainfall_mm * fertilizer + temperature +
nitrogen_n + phosphorus_p + potassium_k, data = clean_data)
Residuals:
Min 1Q Median 3Q Max
-1.95632 -0.41442 -0.03545 0.44915 1.58503
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.639e+00 3.688e+00 0.716 0.476039
rainfall_mm -1.174e-03 3.488e-03 -0.336 0.737320
fertilizer -3.159e-02 3.206e-02 -0.985 0.327009
temperature -1.894e-01 5.221e-02 -3.628 0.000471 ***
nitrogen_n 1.022e-01 2.906e-02 3.517 0.000683 ***
phosphorus_p 1.153e-01 6.728e-02 1.713 0.090071 .
potassium_k 3.061e-01 7.319e-02 4.182 6.63e-05 ***
rainfall_mm:fertilizer 8.041e-06 4.601e-05 0.175 0.861640
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6899 on 91 degrees of freedom
Multiple R-squared: 0.8856, Adjusted R-squared: 0.8769
F-statistic: 100.7 on 7 and 91 DF, p-value: < 2.2e-16
Diagnostic Plots
# Diagnostic plots
par(mfrow = c(2, 2))
plot(model2)
Rainfall and Fertilizer may only show their power under non-linear or conditional effects (e.g., diminishing returns, thresholds). This is exactly where a Decision Tree or Random Forest would shine; it doesn’t assume additive or linear relationships.
Decision Tree Model
library(rpart)
library(rpart.plot)
# Fit the decsion tree
tree_model <- rpart(formula = yield_q_acre ~ rainfall_mm + temperature + fertilizer + nitrogen_n + phosphorus_p + potassium_k, data = clean_data, method = "anova")
# Visualise the tree
rpart.plot(x = tree_model, type = 2, fallen.leaves = TRUE, extra = 101)
The root node of the decision is the first split and shows the most influential feature in predicting crop yield. In this model, rainfall is at the root, meaning it has the highest information gain and is the most important factor in determining crop yield. The subsequent splits provide additional insights into how different combination of features influence the prediction of crop yield.
Random Forest
library(randomForest)
randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.
# Fit the random forest model
rf_model <- randomForest(formula = yield_q_acre ~ rainfall_mm + temperature + fertilizer + nitrogen_n + phosphorus_p + potassium_k, data = clean_data, importance = TRUE, ntree = 500)
# View variable importance
importance(rf_model)
%IncMSE IncNodePurity
rainfall_mm 17.08852 103.34336
temperature 20.40790 107.78519
fertilizer 10.55976 49.62138
nitrogen_n 12.04265 53.38558
phosphorus_p 12.96450 13.08540
potassium_k 16.24468 38.43646
varImpPlot(rf_model)
Variable Important Metrics
%IncMSE: This tells how much the mean squared error increases when that variable is permuted; higher means more important.
IncNodePurity: Measures how much a variable improves node purity (i.e., splits that reduce variance); also higher means more important.
Based on the output, temperature is the most impactful predictor of
crop yield. Rainfall and Potassium are also strong predictors.
Fertilizer has the least influence in the randomForest model.
For better visualisation than the VarImpPlot()
# Tidy up the variable importance
var_imp <- importance(rf_model) %>%
as.data.frame() %>%
tibble::rownames_to_column(var = "Variable") %>%
arrange(desc(`%IncMSE`))
# Round for readability
var_imp <- var_imp %>%
mutate(across(-Variable, round, 2))
print(var_imp)
# Plot variable importance using %IncMSE
ggplot(var_imp, aes(x = reorder(Variable, `%IncMSE`), y = `%IncMSE`)) +
geom_col(fill = "#2c7fb8") +
coord_flip() +
labs(
title = "Variable Importance from Random Forest",
x = "Variable",
y = "% Increase in MSE"
) +
theme_minimal()
Model Performance Comparison
The model performance will be compared using standard regression metrics:
R2 - R squared
RMSE - Root Mean Squared Error
MAE - Mean Absolute Error
library(Metrics)
library(caret)
# yield_q_acre is my response variable
# Get true values
actual <- clean_data$yield_q_acre
# Linear regression model
pred_lm <- predict(lm_model, newdata = clean_data)
# Decision tree predictions
pred_tree <- predict(tree_model, newdata = clean_data)
# Random forest prediction
pred_rf <- predict(rf_model, newdata = clean_data)
# --------------------------------------------------
# CREATE A PERFORMANCE TABLE
# --------------------------------------------------
# Function to compute metrics
get_metrics <- function(actual, predicted) {
data.frame(
R2 = R2(predicted, actual),
RMSE = rmse(actual, predicted),
MAE = mae(actual, predicted)
)
}
# Combine all into one table
performance_comparison <- rbind(
Linear_Regression = get_metrics(actual = actual, predicted = pred_lm),
Decision_Tree = get_metrics(actual = actual, predicted = pred_tree),
Random_Forest = get_metrics(actual = actual, predicted = pred_rf)
)
print(performance_comparison)
Higher R2 and lower RMSE/MAE = better performance
performance_comparison$Model <- rownames(performance_comparison)
performance_long <- pivot_longer(performance_comparison, cols = c(R2, RMSE, MAE), names_to = "Metric", values_to = "Value")
ggplot(performance_long, aes(x = Model, y = Value, fill = Metric)) +
geom_col(position = "dodge") +
labs(title = "Model Performance Comparison", y = "Metric Value", x = "") +
theme_minimal() +
scale_fill_brewer(palette = "Set1")
Using train/test split to validate them properly
# Split the data into 70% training and 30% testing
set.seed(123)
split_index <- createDataPartition(clean_data$yield_q_acre, p = 0.7, list = FALSE)
train_data <- clean_data[split_index, ]
test_data <- clean_data[-split_index, ]
# Fit Models on Training Data
# Linear Regression
lm_model <- lm(yield_q_acre ~ rainfall_mm + temperature + fertilizer + nitrogen_n + phosphorus_p + potassium_k, data = train_data)
# Decision Tree
tree_model <- rpart(yield_q_acre ~ rainfall_mm + temperature + fertilizer + nitrogen_n + phosphorus_p + potassium_k, data = train_data, method = "anova")
# Random Forest
rf_model <- randomForest(yield_q_acre ~ rainfall_mm + temperature + fertilizer + nitrogen_n + phosphorus_p + potassium_k, data = train_data, importance = TRUE, ntree = 500)
# Predict on Test Set & Evaluate
# True values
actual <- test_data$Yield_Q_acre
Warning: Unknown or uninitialised column: `Yield_Q_acre`.
# Predictions
pred_lm <- predict(lm_model, newdata = test_data)
pred_tree <- predict(tree_model, newdata = test_data)
pred_rf <- predict(rf_model, newdata = test_data)
# Metrics function
get_metrics <- function(actual, predicted) {
data.frame(
R2 = R2(predicted, actual),
RMSE = rmse(actual, predicted),
MAE = mae(actual, predicted)
)
}
actual <- as.numeric(test_data$yield_q_acre)
pred_lm <- as.numeric(pred_lm)
pred_tree <- as.numeric(pred_tree)
pred_rf <- as.numeric(pred_rf)
# Combine into one table
performance_comparison_1 <- rbind(
Linear_Regression = get_metrics(actual, pred_lm),
Decision_Tree = get_metrics(actual, pred_tree),
Random_Forest = get_metrics(actual, pred_rf)
)
print(performance_comparison)
NA
Visualise the performance
performance_comparison_1$Model <- rownames(performance_comparison_1)
performance_long <- pivot_longer(performance_comparison_1, cols = c(R2, RMSE, MAE), names_to = "Metric", values_to = "Value")
ggplot(performance_long, aes(x = Model, y = Value, fill = Metric)) +
geom_col(position = "dodge") +
labs(title = "Model Performance Comparison (Test Set)", y = "Metric Value", x = "") +
theme_minimal() +
scale_fill_brewer(palette = "Set1")
Export clean data
write.csv(x = clean_data, file = "./02_clean_data/clean_data.csv", row.names = F)